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Abstract 


Assessing treatment effects in observational studies is a multifaceted problem that not only involves 
heterogeneous mechanisms of how the treatment or cause is exposed to subjects, known as propen- 
sity, but also differential causal effects across sub-populations. We introduce a concept termed the 
facilitating score to account for both the confounding and interacting impacts of covariates on the 
treatment effect. Several approaches for estimating the facilitating score are discussed. In par- 
ticular, we put forward a machine learning method, called causal inference tree (CIT), to provide 
a piecewise constant approximation of the facilitating score. With interpretable rules, CIT splits 
data in such a way that both the propensity and the treatment effect become more homogeneous 
within each resultant partition. Causal inference at different levels can be made on the basis of 
CIT. Together with an aggregated grouping procedure, CIT stratifies data into strata where causal 
effects can be conveniently assessed within each. Besides, a feasible way of predicting individual 
causal effects (ICE) is made available by aggregating ensemble CIT models. Both the stratified 
results and the estimated ICE provide an assessment of heterogeneity of causal effects and can be 
integrated for estimating the average causal effect (ACE). Mean square consistency of CIT is also 
established. We evaluate the performance of proposed methods with simulations and illustrate their 
use with the NSW data in Dehejia and Wahba (1999) where the objective is to assess the impact of 
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a labor training program, the National Supported Work (NSW) demonstration, on post-intervention 
earnings. 


Keywords: CART, causal inference, confounding, interaction, observational study, personalized 
medicine, recursive partitioning 


1. Introduction 


Comparative studies that involve evaluation of the effect of an investigational treatment or a putative 
cause on an outcome variable are fundamental in many application fields. The data may come from 
either a designed experiment or an observational study. Regardless of the data sources, two major 
issues exist when assessing the treatment effect: confounding and interaction effects of covariates. 


A confounding variable or confounder is an extraneous covariate that relates to both the treat- 
ment and the response and hence influences the treatment effect estimation. Controlling or adjusting 
for confounders can be done in either design or analysis. In designed experiments, randomization, 
matching, cohort restriction, and stratification are commonly-used ways to effectively control for 
confounding variables. However, observational studies are often the only available choice due to 
ethical or practical considerations. Causal inference with observational data is particularly challeng- 
ing. The main obstacle is the nonrandom treatment assignment mechanism, in which the subjects 
select a treatment that they believe best serve their interests or are exposed to a treatment according 
to individual traits. As a result, systematic imbalance or heterogeneity may exist between individu- 
als in the treated group and those in the control group. Thus it is crucial to control for confounders 
in the analysis stage of such data. Common approaches include analysis of covariance (ANCOVA), 
propensity score methods (Rosenbaum and Rubin, 1983), and directed acyclic graphs (DAGs; Pearl 
2000 and Spirtes, Glymour, and Scheines 2001). Even with randomized experimental data, covariate 
imbalance can also be revealed when examining data in a multivariate manner. Consider a hypo- 
thetical example where m older women and m younger men are assigned to the treated group while 
m older men and m younger women are assigned to the control group. The data appear to be per- 
fectly balanced in terms of either age or gender, despite the perfect imbalance at their combination 
levels. When the dimension of covariates gets high, each experimental unit essentially represents 
an unique individual that is not replicable, which makes randomization less relevant. This partially 
explains why covariate adjustment is practiced even with randomized experimental data. Associ- 
ated with variable selection issues, additional challenges present themselves in the form of over- 
or under- adjustment when confounders are incorrectly identified. For example, under-adjustment 
occurs when an important confounder is uncollected in the data or excluded from the model. On the 
other hand, some intermediary outcome variables, often referred to effect-mediators, are important 
in understanding the mechanism how and why the treatment becomes effective. As an example 
of over-adjustment, the treatment effect would be under-estimated when a mediator is mistakenly 
considered as a confounder and included in the model for adjustment. Over-adjustment also may 
occur when controlling for a collider that correlates with both the treatment and the outcome via an 
‘M-diagram’ (Greenland, 2003). 

In terms of influence of covariates on treatment effect assessment, another equally important 
issue is interaction, also known as effect modification or effect moderation (see, e.g., VanderWeele 
and Robins 2007 and VanderWeele 2009), which is concerned with differential treatment effects 
at different levels or values of covariates. An effect modifier is a covariate that interacts with the 
treatment and changes the direction and/or degree of its causal effect on the outcome. Existence of 
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interaction complicates model interpretation. Detection of interaction is challenging. While inter- 
actions are mostly formulated via cross-product terms in a linear model and restricted to be of the 
first- or second-order, complex nonlinear or higher-order interactions may exist in reality. It is also 
important to distinguish between qualitative interactions and quantitative interactions. Qualitative 
interaction (Gail and Simon, 1985) occurs when there is a directional change in terms of treatment 
preference, a cause of greater concern to practitioners. Closely related to treatment-by-covariate in- 
teractions, subgroup analysis (see, e.g., Lagakos 2006) is an integral part in the analysis of clinical 
trials. Practitioners and regulatory agencies are keen to know if there are subgroups of trial partic- 
ipants who are more or less likely to be helped or harmed by the intervention under investigation. 
Subgroup analysis helps explore the heterogeneity of the treatment effect across sub-populations 
and extract the maximum amount of information from the available data. On the other hand, sub- 
group analysis is subject to malpractice owing to difficulties in subgroup determination, multiple 
testings, and lack of power. The new stimulating concept of personalized medicine or personalized 
treatments (see, e.g., Jain 2009) is intended to refine the traditional medical decisions by capitalizing 
on results of subgroup analysis or the knowledge of individualized treatment effects. Nevertheless, 
sorting out differential causal effects often entails large data that are collected at post-trial periods, 
for example, the Medicare or Medicaid databases. 

Assessments of confounding and interaction intervene with each other. First of all, confound- 
ing emerges as one primary issue in the assessment of the main effect of treatment, also known as 
the average causal effect (ACE). However, ACE implicitly assumes homogeneity or unimportant 
heterogeneity of causal effects. When strong treatment-by-covariate interaction exists, ACE may 
become less practically useful. This is the case especially when the interaction is qualitative. Sup- 
pose, for example, that the treatment effect is 6 for half of the data (say, males) and —6 for the 
other half (say, females), both having important scientific implications. The ACE in this case is 
null. When solely based on ACE, one would arrive at the misleading conclusion that the treatment 
does not have an effect. On the other hand, when the estimation bias caused by inadequately han- 
dled confounders gets overwhelming, it may be disguised as differential treatment effects. We shall 
illustrate more on this point later with simulation in Section 4. Therefore, it is crucial to have both 
confounding and interaction well addressed in comparative analysis. 

Rubin’s causal model (Rubin, 1974, 1977, 1978, 2005) provides a general framework for making 
these assessments, within which the treatment effect is finely calibrated at three different hierarchi- 
cal levels (i.e., unit, subpopulation, and population) using a counterfactual model and the concept 
of potential outcomes. In this article, causal inference is explicitly reformulated as a predictive 
modeling problem within the framework of Rubin’s causal model. To approach, we introduce a 
concept, termed facilitating score, to address both the confounding and interacting impact of ex- 
traneous variables on causal inference. Conditional on the facilitating score, homogeneity can be 
achieved in both the assignment mechanism and and the effect of the treatment. Then we put for- 
ward a causal inference tree (CIT) procedure, to approximate the facilitating score with a piecewise 
constant function. CIT recursively splits data into disjoint groups in such a way that both treatment 
assignment mechanisms and the treatment effects become more homogeneous within each group. 
On the basis of CIT, a group of recursive partitioning methods are devised to make causal inference 
at different levels. 

The remainder of this paper is arranged in the following manner. In Section 2, following an 
outline of Rubin’s causal inference framework, the concept of facilitating score is introduced and 
methods for estimating the facilitating score are discussed. Section 3 presents the CIT methodology 
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in detail. Section 4 contains simulation studies that are designed to investigate the performance of 
CIT. An illustration is provided via a real data example in Section 5. In Section 6, we extend the 
results to situations where the treatment variable is ordinal or nominal. Section 7 ends the article 
with a brief discussion. 


2. Facilitating Scores 


We first review Rubin’s causal models, then we introduce the facilitating score concept and discuss 
methods for estimating the facilitating score. 


2.1 Causal Inference 


In Rubin’s causal model (Rubin, 1974, 1977, 1978, 2005), a fine calibration of treatment effect is 
facilitated by a comparison between the observed outcome on an individual or unit and the potential 
outcome if the individual had been assigned to the counterfactual treatment group. Adopting his 
notations, let Q = {@} be a finite population with N units, endowed with a probability measure P 
that places uniform mass 1 /N on each unit. Let T = T (œ) be a binary treatment assignment variable 
with value 1 if unit œ receives the putative treatment and 0 otherwise. While the term ‘treatment 
assignment’ or ‘selection’ is best suitable for designed experiments, we shall use it throughout this 
article. In addition, let X = X(@) be a p-dimensional vector of measured covariates for unit @. 

Let Yo = Yo(@) be the response that would have been observed if unit œ were assigned to the 
control group and let Y; = Yı (Œœ) be the response that would have been observed if unit œ received 
the treatment. These two variables are called potential outcomes (Neyman, 1923). In reality, either 
Yo(@) or Yı (œ), but not both, can actually be observed depending on the value of T(@), an inber- 
ent fact called the fundamental problem of causal inference (Holland, 1986). Thus the observed 
outcome is 


¥() = {(1—T(o)}%o(@) + T(@)¥i (0). 


Throughout this paper, we consider random sampling from Q so that {@,...,@,} forms an indepen- 
dent and identically distributed (iid) sample of size n. The available data {(yj,t;,x;) = 
(y(@;),¢(@;),x(@;)) : i = 1,...,n} consist of n realizations of Y, T, and X. For the sake of sim- 
plicity, we sometimes omit unit œ from the notations. 

Causal inference is concerned with the comparison of the two potential outcomes via the ob- 
served data. Holland and Rubin (1988) distinguished three levels of causal inferences: unit level, 
subpopulation level, and population level. The lowest level of causal inference is a comparison of 
Yo(@) and Yı (œ), typically the difference Y;(@)— Yo(@), for each unit œ. Subpopulations can be 
formed by restricting the values of covariates to a partition of Q. The causal effect in a subpopula- 
tion {0 : X(@) € B} is E(Y1|X € B) — E(Yo|X € B) for some Borel set B in the predictor space X. 
The average causal effect (ACE) over the entire population Q is E(Y;) — E (Yo). These three levels 
form a hierarchy of causal inference in decreasing order of strength, in the sense that knowledge of 
upper-level causal inferences can be inferred from that of lowered-level causal inferences, but not 
vice versa. A preponderance of the literature in causal inference is centered on schemes for making 
the population-level inference or estimating ACE under various scenarios. 

Rosenbaum and Rubin (1983) introduced the concept of balancing score to tackle the confound- 
ing issue in causal inference. A balancing score b(x) accounts for the dependence between X and 
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treatment assignment or selection 7; that is 
X LL T|b(X). 


Treated and untreated subjects sharing the same balancing score tend to have the same distribution 
of covariates. Various covariate adjustment techniques implicitly adjust for an estimated scalar 
balancing score. They showed that the propensity score 


ex) HPT =1|X =x), 


which is defined as the conditional probability of assignment to the treated group given the measured 
covariates X, is the coarsest balancing score. Namely, b(x) is a balancing score if and only if b(x) 
is finer than e(x), that is, e(x) is a function of b(x). 

Propensity score based matching, stratification (or subclassification), and adjustment have been 
extensively used to balance the discrepancy in covariates between the treatment groups in the as- 
sessment of ACE. In propensity score analysis, the assumption of strong ignorability plays a pivotal 
role. Similar to that of missing at random (MAR) in the missing data literature (Rubin, 1976), this 
assumption states that P(T |X, Yo, Y1) = P(T |X) or, 


T 11 (¥%,¥1) |X. 


It is possible that strong ignorability is violated even there are no unmeasured variables that are 
direct causes of any pair of measured variables. See, for example, Greenland (2003) for more 
discussions. It is worth noting that this assumption does not imply that T LL Y |X. To illustrate, 
consider a simple example where the causal effect at the unit level is constant, namely, Yı (œ) — 
Yo(@) = ô for any w. Suppose that Yo = f(X) +£ and Y; = f(X)+6+¢, where £ LL X is the error 
term. It follows that Y = 57 + f(X) +e. The ignorability assumption amounts to € LL T |X, which, 
by no means, implies Y LL T |X. 

Under this assumption of strong ignorability, Rosenbaum and Rubin (1983) established that 
(Yi, Yo) LL T|b(X) when 0 < e(X) < 1.It follows that 


E(Y|b(X),T = 1) —E(¥o|b(X), T = 0) = E(Y1|b(X)) — E(Yo|b(X)). (1) 


Therefore, the population-level causal interpretation may be achieved by averaging over the distri- 
bution of b(X), 
E(Y, — Yo) = Epyx {E (M1 |b(X)) — E(Yo|b(X)) f- (2) 


Equations (1) and (2) provide the basis for propensity score based methods. 


2.2 Facilitating Score 


Parallel to confounding, interaction is concerned with differential causal effects among units or sub- 
populations. It is important to note that both Equation (1) and (2) involve a reduction of hierarchy in 
causal inference, where individual-level inferences are integrated to make subpopulation-level infer- 
ences on Q; = {0 : b(X(@)) = b} or sub-population level inferences are reduced to the population- 
level inference on Q. Such a reduction may not be taken for granted, because it implicitly assumes 
homogenous lower-level causal effects. Specifically, if substantial differences in causal effects are 
present at a lower level of inference, then transition to an upper-level inference may not be plausible 
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and conclusions based on upper-level causal effects can be misleading. This can be particularly 
problematic when qualitative interactions exist. 
To gain insight, note that with balancing score b(X), 


X LL (Yo, V1) | b(X). 


As a result, 6,(X) = E(Yi|b(X) = b) — E (Yo|b (X) = b) in (2) is not a constant, but a function of 
X within the subpopulation Q,. If p(X) varies substantially with X, we say that a treatment-by- 
covariate interaction exists. In this case, the overall causal effect 6, in Q, becomes less pertinent 
as it implicitly assumes that p(X) can be reduced to a constant õp. A fine delineation of treatment 
effect (X) at the individual level is desirable in the efforts of advancing personalized medicines. 
Even if estimating 5, is of interest, it cannot be summarized by direct comparison of treatment 
means. Instead, it should be obtained by integrating over the distribution of X in Q,, that is, 6, = 
Jo, 5b(X) du(x). Direct comparison of treatment means in Q, makes another implicit assumption 
that, within Q,, X follows a uniform distribution. The same problem remains when using (2) for 
ACE estimation. 

It is therefore critical to take both heterogeneous treatment assignment mechanisms and dif- 
ferential treatment effects into consideration when assessing the treatment effects. We introduce a 
concept termed facilitating score to address these two issues simultaneously. 


Definition 1 A facilitating score ay(X) is a qo-dimensional (0 < qo < p) function of X such that 
X LL (Yo,¥1,T) | ao(X). 


In this definition, the joint independence between X and (Yo,¥1,7) given ag(X) can be relaxed as 
two marginal independence conditions: X LL T |ao(X) and X LL (Y,Y1)|ao(X), which separately 
address the confounding effect and the interacting effect of X. But, if strong ignorability, that is, 
T LL (Y%,¥1)|X, is further assumed, it follows that T LL (¥o,¥1)|ao(X) and hence the marginal 
independence implies the joint independence as well. Existence of aọ(X) is guaranteed, since X 
itself can be regarded as a facilitating score. 

Nevertheless, Definition 1 places strong requirements on ag(X). Estimating the facilitating 
score essentially involves jointly modeling {Yo,Y,,7 } conditional on X, which is unworkable since 
(Yo, Yi) can not be observed at the same time. To get around this difficulty, we next consider a 
weaker definition of facilitating score that is more practically useful. 


Definition 2 A weak facilitating score a(X) is a q-dimensional (0 < q < p) function of X such that 
(i) X LLT |a(X) and (ii) E(Y, — Yo|X) = E (Yı — Yola(X)). 


By condition (i), a weak facilitating score a(X) must be a balancing score; by condition (ii), any 

effect moderation owing to X can be fully represented by a(X). Condition (ii) is equivalent to saying 

that E(¥, — Yo|a(X) = a) is independent of X. However, this does not necessarily imply that 
E(Y|X) = E(Yı|a(X)) and £(¥|X) = E(Yola(X)). 3) 


There could exist a common function g(X) that has been cancelled out in Condition (ii). Namely, 


g(X) = E(Y|X) — E (Yi |a(X)) = E (Yo|X) — E(Yola(X)). 
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A facility score must also be a weak facilitating score, but not vice versa. We use the term ‘fa- 
cilitating’ because conditioning on a(X) helps facilitate causal inference, in the sense that causal 
inference within the sub-population Qa = {0 : a(X(@)) = a} can be conveniently obtained via di- 
rect comparison of sample mean responses. This is because both propensity and the treatment effect 
õa become constant within Qa. 

Since the propensity e(X) is the coarsest balancing score, it follows that e(X) = e in Qa. In 
some scenarios, e(X) is explicitly a separate component of a(X), as exemplified by the parametric 
approach outlined in Section 2.3; but this is not necessarily true in general, as exemplified by the 
semi-parametric approach outlined in the same section. In terms of stratification, Qa provides ad- 
ditional refinements of Q. = {0 : e(X(@)) = e} in order to achieve homogeneous within-stratum 
treatment effects. 


Theorem 3 Suppose that the conditional joint density of (Y,T) given X, fyr\x(¥,T|X), can be 
written as fy r\x(¥,T|X) = g(Y,T,h(X)) for some function g(-). In other words, (Y,T) 1. X|h(X). 
Assuming that treatment assignment is strongly ignorable, h(X) is a weak facilitating score when 
0<e(X) <1. 


We defer the proof of Theorem 3 to Appendix A, where it is established as a special case of a more 
general result in Theorem 7. Theorem 3 basically states that both confounding and interacting effect 
of X on causal inference with the potential outcomes (Y1, Yo) can be handled by working with the 
observed data (Y,7,X). More specifically, if the joint density of (Y,T) given X can be accounted 
for by a vector-valued function h(X), that is, (Y,7) LL X|h(X), then h(X) is a weak facilitating 
score. Besides, it can be shown that Equation (3) holds for h(X), that is, E (Y1 |X) = E (Yı |h(X)) and 
E(Y|X) = E(Yo|h(X)). This condition will be relaxed in Section 2.3. 

Estimation of h(X) involves modeling the joint distribution of (Y,7) given X. Searching for a 
satisfactory h(X) is not an easy task; we have to look for approximate solutions. On the other hand, 
it is no longer unattainable as the involved elements (Y,7,X) are all observed. Although h(X) is 
generally set as vector-valued, its dimension should be small in order to be practically useful. 


2.3 Estimating the Facilitating Score 


We shall discuss three proposals for finding useful approximations of h(X), which are parametric, 
semiparametric, and nonparametric in nature, respectively. While they are all methodologically 
interesting, we deem the nonparametric approach most practically useful. 

The first method is parametric. Consider 


fY,T|X) = f(V|T,X)- f(T|X) 
{FOIT =1,X)}" {fT =0,X)}" -f(TIX) (4) 
by Bayes’ rule. With a parametric approach, we assume a model for each of the terms in (4): 
propensity score model for f(T |X) and outcome regression models for f (Y |T =0,X) and f(Y|T = 


1,X). It is convenient to model T|X with logistic regression and model Y|(7,X) with Gaussian 
linear regression so that 


JETE = Eo (2E) «fetta xy} a), 6) 
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where o is the constant error variance; 
u = E(Y|T,X) =Yot MT +h (X) +T -h2(X); (6) 


(-) is the density function of the standard normal N (0,1) distribution; and m(x) = exp(x)/(1 + 
exp(x)) is the logistic or expit function. 


Proposition 4 Suppose that the propensity model can be specified by e(X) = e(h3(X)) as in (5) and 
the conditional mean response given (T,X) is formulated by (6). Under the assumption of strong 
ignorability, h(X) = (h2(X),h3(X))! is a weak facilitating score. 


The proof is provided in Appendix B. Proposition 4 says that hı (X) is not a necessary component 
of a weak facilitating score. It holds as long as the conditional mean outcome is specified by (6); in 
other words, normality is not needed either. Besides, note that Equation (3) is not required with this 
definition of h(X). 

To continue with the parametric approach, linearity is further enforced so that hj(X) = Bix j for 
j =1,2,3, where X; contains selected components of X. The involved parameters © = {B,y,o} can 
be estimated via maximum likelihood in a straightforward manner. The likelihood function is 


71, yio £ tot toalet ae 
L(0) = I] = (5#) [rGsx)} {1—2(B5x;)} | = LL. (7) 


Clearly there is a variable selection issue involved. Note that (8,,B,) are involved only in Lı for 
the outcome regression model while B; is involved only in L3 for the propensity score model. This 
property not only simplifies the likelihood optimization, but also allows for variable selection to be 
performed separately for the propensity model and outcome regression models. 

With an estimated h(x) = (Box, Bx)! , data can be stratified via combined use of the medians 
or terciles of its components, similar to propensity score subclassification. While this parametric 
method provides a feasible approach for stratification, there are several difficulties in practice. First 
of all, it is a two-step approach. The final results rely on correct model specifications. Moreover, 
the number of strata has to be rather arbitrarily determined. The fact that h(x) is vector-valued 
contributes added difficulties to execution. In particular, as the dimension of h(x) increases, the 
number of strata grows precipitously. Even with only two categories induced by each component, 
there are 24 subclasses for a g-dimensional h(x). 

Another intuitive semi-parametric approach to estimate h(X) is via dimension reduction tech- 
niques. In view of (Y,7) LL X|h(X), if it is further assumed that h(X) is linear in X so that 
h(X) = BX, then the subspace spanned by columns of B, S(B), is called the dimension-reduction 
subspace that accounts for the conditional distribution of (Y,7) given X. Let Siy,r)|x denote the in- 
tersection of all dimension-reduction subspaces. Under some regular assumptions, S,y,7))x is also a 
subspace, termed the central dimension-reduction subspace or central space. Sliced inverse regres- 
sion (SIR; Li 1991) and its variants can be used to estimate Sy r)jx. While further research efforts 
are needed in handling the bivariate response (Y,T), there is no additional conceptual complication 
involved. For example, one convenient approach is to first introduce (2S) slice indicator variables 


Za =H Ok <Y Sy)A(T =t)}, 
where s = 0,1,...,S; t = 0,1; and {—% = yọ < y, <+ < yg = +00} are pre-specified grid points 
that define S slices for Y. Then the sliced regression method (Wang and Xia, 2008) can be applied 
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to estimate the central mean space of Z = (Zs), which approximates the central space Svy,r)\x. 
Nevertheless, the same above-mentioned difficulties as in the parametric approach remain when it 
comes to stratification on the estimated linear facilitating scores. 

In the next section, we consider yet another recursive partitioning based nonparametric alterna- 
tive, which seems to provide a more satisfactory solution to the problem. Hereafter, we refer to this 
method as CIT for causal inference tree. CIT combines estimation of h(x) and data stratification 
into one step. On the basis of CIT, we devise methods for making causal inference at different 
levels. 


3. Causal Inference via Recursive Partitioning 


Tree-based methods (Morgan and Sonquist 1963 and Breiman et al. 1984) approximate the under- 
lying function of interest with piecewise constants by recursively partitioning the predictor space. 
At the same time, a tree structure offers natural grouping of data with easily interpretable splitting 
rules. With an automated algorithmic approach, CIT seeks disjoint groups that have homogeneous 
joint density of (Y,7) within each. The resultant grouping rules, which are induced by binary splits 
on the covariates X, are meaningfully interpretable, implying a nonparametric estimation of the 
facilitating score. 

In this section, we first follow the CART (Breiman et al., 1984) convention to construct one 
single CIT, which consists of three steps: growing a large tree and selecting the optimal subtree via 
pruning and cross validation. On the basis of CIT, methods for causal inference at different levels 
are then developed. CIT itself provides a natural stratification of data for subpopulation inference. 
An aggregated grouping method is introduced in order to enhance its performance. Conditional 
inference at the individual unit level can also be obtained by combining results from ensemble 
CIT models. Both stratified and individualized causal effect estimates can help depict variations 
in propensity and treatment effects and make available a natural evaluation of the plausibility of 
treatment comparability and ACE assessment. These results can also be integrated for estimating 
ACE estimates. Finally, we establish the mean square risk consistency of CIT under conditions 
similar to those in CART (Breiman et al., 1984). 


3.1 Causal Inference Trees (CIT) 


A tree model can be expressed as a graph with connected nodes, each node corresponding to a subset 
of the data. We use J as a generic notation for a tree structure and T for a node. In tree modeling, 
the effects of X are exclusively explained by the splitting rules. To start the tree construction, 
we consider one single split of data. When restricted to a node 7, the distribution of (Y,7) no 
longer depends on X, implying a constant propensity and a constant treatment effect. Following 
decomposition of the joint density fz(¥,T) = f:(Y |T) f(T) within node 7, it is convenient to impose 
that 


T ~ Bernoulli (t+) and Y|T ~ N {u=(1—T)uo+T pri, 02}. 
We would like to comment that recursive partitioning can be viewed as a localized approach with 
local optimality achieved at each split. In local areas, the model needs not to be complicated and 


often employs a parametric form. The procedure starts with splits that are built upon something 
that is relatively simple and then evolves into a comprehensive model by recursively bisecting. 
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The resultant tree model is nonparametric in nature and relatively robust to local distributional 
assumptions. 
The associated log-likelihood function becomes 


. — i 2 
Miceli Hi) + nq Int, + noln(1— mr) 





h= 2 In(270?) 


where {n;, n0, } are the total number of observations in node t, the number of observations 
in node T that are assigned to the control group, and the number of observations in node T that 
are assigned to the treatment group, respectively. Maximum likelihood estimates of the involved 
parameters are explicitly available: %, = nq /n, Peo = Yro, Ati = Yu, and ô? = SSE, /nz, where 


SSE, = >. (yi =Fy)° =F »: (yi — 7o), 
{iet: ti=1} {iet: ti=0} 


and {¥0,¥r1} are the sample average responses of the control and treatment groups in node T, 
respectively. Up to a constant, the maximized log-likelihood function in node T is 


A n 
lh œ 0 (nr: SSEz) +nq lnn +n lnnzp. (8) 


Note that we have assumed a mean-shift Gaussian model with the same constant variance for the 
causal effect. If different variances are considered, the final form of Î; would be slightly different. 

Without loss of generality, we consider binary splits only. When a split s bisects node Tt into the 
left child node tz and the right child node Tz, the associated likelihood ratio test statistic is 


LRT (s) = 2 z (i + fr — it), (9) 


where the maximized log-likelihood score for nodes Tz and Tp, f and les can be obtained in the 
same manner as /, in (8). The LRT; can be used as the splitting statistic to select the best split. After 
removing irrelevant components, we have 


LRT (s) ox —n; /2-In (ne, SSE) — nrg /2- 1n (Nrg SSErg) + 


Nay 1 Innz,1 + 77,0 Innz,o +Nzp1 Innri + Argo Inno- 


The best split s* is the one that yields the maximum LRT (s) among all allowable splits. Accordingly 
node T is split into Tz and Tr using s*. Subsequently, a similar procedure is applied to split either of 
Tz and Tr. We repeat the procedure until some mild stopping rules are satisfied. This process results 
in a large initial tree, denoted as Jp. 

The final tree model is a subtree of Jọ. Nevertheless, it is practically infeasible to examine 
every subtree because the number of subtrees increases rapidly as the number of terminal nodes 
in J increases. The idea of pruning is to provide a subset of candidate subtrees by iteratively 
truncating off the ‘weakest link’ of Jj. There are several pruning algorithms available, including the 
cost-complexity pruning of CART (Breiman et al., 1984) for trees that are built upon minimizing 
within-node impurity, the split-complexity pruning of LeBlanc and Crowley (1993) for trees that 
are built upon maximizing between-node differences, and the AIC pruning of Su, Wang, and Fan 
(2004) for trees that are built within the maximum likelihood framework. Since CIT is essentially 
likelihood based, the AIC pruning is adopted for direct use. We shall keep our descriptions concise 
by referring the reader to appropriate references for greater details. 
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In the AIC pruning algorithm, the performance of a given tree J is measured by the Akaike 
(1974) information criterion: s 
AICg = —2 -lr +À (4-|Z]) 


where the associated maximized log-likelihood of 7 is 


lp = } f; (10) 


ET 


A = 2 is the penalty parameter for tree complexity; and IT | denotes the number of terminal nodes in 
T, with T being the set of all terminal nodes in 7 and |- | for cardinality when the argument is a set. 
Note that each added terminal introduces four more new parameters {1+, Hz0, Ut1, Or}. Thus the total 
number of parameters in tree T is 4-|Z|. A tree with a smaller AIC is preferable. Alternatively, the 
Bayesian information criterion (BIC; Schwarz 1978) with A = In(n) is another choice in common 
use. At each step, the algorithm examines all available internal nodes or links in the present tree and 
truncates the link that results in the subtree with the smallest AIC. The pruning procedure yields a 
nested sequence of subtrees Jo > Ti > ---Zy, where Jy is the null tree structure with root node 
only and “>” is read as “has subtree”. 

The final step of tree size selection entails identifying the optimal tree Z from the subtree 
sequence. The same AIC or BIC measure can be used for this purpose. However, cross validation 
is needed to validate / in Equation (10), which can be done via either the test sample method or 
resampling methods (V-fold cross-validation or bootstrapping), depending on the available sample 
size. Again, we refer readers to Su, Wang, and Fan (2004) for details. 

Remark Using the parametric approach in Section 2.3, an alternative splitting statistic can be 
obtained by maximizing the between-node difference. To split node 7q, let Z, denote the indicator 
function corresponding to a permissible split s of t. Consider model 


Pr(T = 1x) _ 
loge Tox) ~ Bo + Bits and 
y=wtnT +l +T -I +08 with £ ~ N(0,1). aD 


In view of Proposition 4, the Wald test statistic for testing Ho : Bı = Y3 = 0 can be used as the splitting 
statistic. Since the log-likelihood function is separable for B and y as shown in (7), cov(B,¥) = 0. 
After some algebraic simplification, the Wald test statistic is given by 


-1 2 — = = — 2 
( 1 1 a 1 | 1 ) Ca q Oui = Fe.0) = Fret Fro) } 


T A 
No Nel MgO Magl Nr; 1 NtR0 ô? 








where ô? = {yey — (n020 +n + Nrp0Vz00 + epi Vz—1) } /n is the MLE of o? in model 


(11). 


3.2 Aggregated Grouping 


Despite easy interpretability, one single tree model is notoriously unstable in the sense that a minor 
perturbation of the data could result in substantial changes in the final tree structure. In order to 
get around this problem, we propose an aggregated grouping method to integrate the stratification 
results from a number of competitive tree models. The key idea of this method is to obtain an n x n 
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distance or dissimilarity matrix D with entries that measure how likely each pair of observations is 
assigned to different strata. Cluster analysis can then be applied for final grouping. 

The procedure is described as follows. Let £ denote the whole data set. At each iteration b for 
b =1,...,B, generate bootstrap sample £ from £. Divide £) into two parts at random with a 
ratio of 2:1, the learning sample L” and the test sample Le), Using ie a large initial CIT Ge) 
is grown and pruned. With the aid of the test sample Le), a best-sized tree qe) is selected. Let 


(b) 


Kp = IZ] be the number of terminal nodes in Ze“. Then we apply qe) to the whole data £ so 


that each observation in £ falls into one and only one terminal node of qT). Next, we define an 
n x n pairwise distance matrix D, = {dj} such that 


Hs 0 if observations {i,i’} fall into the same terminal node of qo). 
ý 1 otherwise, 


for i,i’ = 1,...n. To compute D,, first obtain an n x Kp matrix Z, = (zip) such that 


1 if observation i falls into the k-th terminal node, 
=S o ; (12) 
otherwise, 
fori = 1,...,n and k = 1,...,K,. It follows that 
D, = ZZ}. (13) 


Next, the distance matrices are integrated by averaging over B iterations: D = s D;/B. It can 
be seen that the entries in D satisfy the triangle inequality and other properties that are required for 
being a legitimate distance measures. Finally, we can apply a clustering algorithm on D in order 
to obtain the final data stratification. The number of clusters K can be either determined by the 
clustering algorithm itself or preset as the mode of Kp’s. Other techniques for exploring distance or 
proximity matrices can also be applied, such as multidimensional scaling (MDS; Torgerson 1958). 
The whole procedure is outlined in Algorithm 1. 


Algorithm 1 Pseudo-Codes for Aggregated Grouping 

Set B +— number of repetitions. 

for b= 1 to B do 
— Generate bootstrap sample LD, 
— Randomly divide data £ into pere with a ratio of 2:1. 
— Grow a large CIT qe) using ra and prune. 
— Select the best tree Z”? using i. Let K, = bag le 
— Apply ge) to data £ and compute distance matrix D; = (dj) such that diy = 1 





if observation pair {i,i’} falls into different nodes of Ti and 0 otherwise. 
end for 
Obtain D + 1/B-Y?_, Dy; 
Obtain K + mode{K, : b = 1,...,B}. 
Apply a clustering algorithm on D with K clusters. 





We also suggest an optional alternative for computing the distance matrix D,, which is moti- 
vated by the amalgamation or node merging idea of Ciampi et al. (1988). It is common that non- 
neighboring terminal nodes in a final tree structure do not show much differences from each other. 
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This is because similar patterns in treatment assignment and effect may occur to sub-populations 
with different characteristics. By taking this issue into consideration of the distance matrix D, in 
Algorithm 1, a more effective way of grouping data may be achieved. 

To do so, we first obtain a Kp x Kp, pairwise distance matrix K, = {x} for all the terminal nodes 
in Ze) the best-sized tree obtained in the b-th iteration. Here, K = «(t,t’) > 0 denotes the distance 
between two terminal nodes T, T € GZ) which can be defined as the logworth (i.e., the negative 
logarithm with base 10) of the p-value obtained from a likelihood ratio test in (9) that compares T 
with T’. That is, 


x(t, T’) = —log,9(p-value). 


The likelihood ratio test can be conducted using all data in £. The smaller the p-value, the larger 
the difference between T and T is, as reflected by a larger value of «(t,t’). Elements in matrix D, 
are then defined by 


din = «(t(i),t(i’)), 


where t(i) denotes the terminal node into which the i-th observation falls. In matrix form, D, can 
be computed as 


D; = Z)KpZi,, (14) 


where Z, is given by (12). The D; in (13) can be viewed as a special case of (14) with K, = Ip. 

With modified D, in (14), there are two immediate consequences: first, the distances dj in D 
may not necessarily satisfy the triangle inequality; secondly, the number of final clusters K can no 
longer be suggested by the best tree sizes. Instead, it has to be determined by the clustering algo- 
rithm itself. Recent work on automatic determination of the optimal number of clusters is exempli- 
fied by Tibshirani, Walther, and Hastie (2001) and Wang (2010). Both methods are computationally 
demanding. 

Compared to one single CIT, the aggregated grouping produces a more accurate and stable 
grouping of data. Its results can help evaluate the instability of CIT. However, one drawback is loss 
of interpretability of the stratification rules. 


3.3 Summarizing Strata and ACE Estimation 


To summarize the final K strata obtained from either one single CIT or the aggregated grouping 
method, estimated propensity rate é, and the treatment effect A, can be obtained for each stratum. 
Such information helps delineate the heterogeneity structures in both assignment mechanisms and 
effects of the treatment. Strata with extremely low or high propensities may be excluded from 
causal inference due to lack of comparison basis. One may take a liberal approach when inspecting 
differential causal effects across K strata. The use of ACE to summarize treatment effects can be 
tentatively justified unless strong evidence for qualitative interaction exists. This is similar to the 
common practice in multi-center trials. While the quantitative treatment-by-center interaction is 
commonly seen, the overall efficacy of an investigational drug can still be established as long as 
there is no significant directional change in the comparison. An estimate of ACE, Ais given by 


(Fer — Yro) (15) 


>) 
II 
TMa 
JE 
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with sampling variance approximated by 


K ,2 2 2 
pa (243), (16) 


Nki Nko 


where (Fer 521) are the sample mean and variance of observed Y’s in the treated group of the k-th 
stratum and similar definitions apply to other quantities. Additional covariate adjustment within 
each terminal node can be made and alternative stratification estimates of ACE are available, as 
summarized and discussed in Lunceford and Davidian (2004). 

Propensity score stratification or subclassification seeks subpopulations of form Q; = {0 : 
e(X) = e}, in which homogeneity of treatment effects, however, can not be guaranteed. Direct 
comparison of the mean responses could give a distorted estimate of the causal effect in Q,. Com- 
paratively, CIT and aggregating grouping offer refined stratification so that the causal effect within 
each resultant stratum Qa can be correctly captured, which consequently offers improved estima- 
tion of ACE. Alternatively, one may try to correct the problem with propensity score stratification 
by applying additional ANCOVA-typed adjustment within each stratum. It is important to note 
that ANCOVA does no help with this correction, unless effect modification is incorporated into the 
model by allowing for treatment-by-covariate interaction terms. This approach would consist of 
two steps. In the first step, a number of strata are obtained by stratifying propensity scores. In the 
second step, an extended ANCOVA model that allows for interactions is fit within each stratum. We 
may adopt an approach explained by Aiken and West (1991) in order to make the overall causal 
effect in Qe appear as a regression coefficient. This approach fits a linear model of form 


E(¥;|T;,X;) = Bo +8. T; + B'xi +T; Yxi. (17) 


where x’ = x; — E(X|Q,) for i € Q, denotes the centered covariate vector. Then the parameter 6, 
in (17) coincides with the overall causal effect in Q.. Finally, the ACE is estimated by combining 
8.’s via (15). The CIT stratification roughly resembles this two-step approach described above, yet 
with additional advantages. First, the facilitating score offers a unified setting where these two steps 
are naturally combined. Secondly, how to specify interaction terms in (17) remains a dazzling task, 
which, however, can be efficiently handled with recursive partitioning in CIT. 


3.4 Predicting Individual Causal Effects (ICE) 


With the advent of research with biobanks, molecular profiling technologies have been greatly ad- 
vanced to allow for collection of a patient’s proteomic, genetic, and metabolic information. Given 
various information collected on a patient, how to customize treatments to the individual’s best 
needs has posed great challenges to players in the field of personalized medicine, including statisti- 
cians. A fine delineation of treatment effects plays a critical role in such endeavors. 

For this purpose, we define “individual causal effect” (ICE) as a conditional expectation E (Y; — 
Yo|x), given a subject with X = x. ICE is conceptually different from the unit level causal effect 
Yı (©) — Yo(w). Strictly speaking, ICE makes conditional causal inference at the subpopulation level 
{@: X(@) =x}. On the other hand, ICE is the best that one could practically do with available infor- 
mation in order to approximate the unit level causal effect. Especially when X is high-dimensional 
and has many continuous components, it is likely that each value x corresponds uniquely to unit œ 
with X(@) = x. In what follows, we devise a powerful method via ensemble CITs to predict ICE by 
borrowing ideas from random forests (Breiman, 2001). 
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To proceed, we first randomly divide data £ into V folds. To ensure similar proportions of 
individuals in the treatment groups across all folds, stratified sampling with stratification on T can 
be used. Let £, denote the v-th fold and Ly =L-L, for the remaining data. 


Algorithm 2 Pseudo-Codes for Predicting Personal Causal Effects (ICE) 
Set V, B, and m. 
Randomly split data £ into V sets {L;,..., Ly}, with stratification on T. 
for v = 1 to V do 
Set Ly = L—Ly. 
for b = 1 to B do 
— Generate bootstrap sample a from Ly). 





— Grow a CIT a using rae without pruning. At each split, only m randomly selected 
variables are used. x 
— Estimate causal effects Â, and propensity é, for each T € T based on L(y). 


— Apply a to data 4. 


— Compute A) and a) for each i € LANT, via A, and é,. 

end for 

Obtain {A;, é;} as averages of (A, eh) :b=1,...,B}, for each i € 4. 
end for 
Merge estimated {Â;, é;} into data £ using ID key. 


return L. 





We draw B bootstrap samples from L). With each bootstrap sample Te grow a moderately- 


sized CIT a without pruning. When constructing Le we adapt the approach in random forests 
(Breiman, 2001), where only m randomly selected variables and their associated cutoff points are 
evaluated at each split. This tactic helps improve the predictive performance by de-correlating the 
tree models in the random forests. For each terminal node T € ls estimates of the causal effect 
and propensity, 


A, = Yr — yro and ê =mi /[m, 


are computed using data in £;,). Then we apply T to Liy) and predict the ICE A?) and propensity 


score a?) for each individual i € 4. Specifically, 
A? =A, and a) = ĉr, 


if the i-th individual falls into the terminal node t. The final predicted ICE and propensity for the i 
individual are 


—_ 


Me» 


A 


3 ieee 
Â=- F A and G = —y 2. 
B; 1 Bix 


Their standard errors can also be obtained from the bootstrap repetitions. 

The same procedure is repeated for each fold to estimate ICE and propensity scores for all 
individuals in £. The whole method is described in Algorithm 2. Further exploration can be done 
with the estimated ICE and propensity scores and some illustrations are given in Section 5. While 
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we have used a V-fold cross-validation approach in Algorithm 2, the method can be directly applied 
to an independent future sample for predicting ICE. Other features in random forests such as variable 
importance ranking and partial dependence plots could also be adopted for causal inference. 

Some alternative ways of predicting ICE are discussed below. First of all, the standard method 
for modeling treatment-by-covariates interaction in many application fields is to fit a linear model 
with first-order cross-product terms, that is, 

y = Bot BiT+P5x+T -Byxt+e 
= Bo+ Box + (Bi + B3x)-T +e. (18) 
The ICE is formulated as (Bı + 83x), which is also linear in x. While this parametric approach is 
readily available, it relies heavily on linearity and is subject to a greater risk of model misspecifica- 
tion. 

Another convenient way for predicting ICE is to use the ‘regression estimation’ approach, as 
described in Schafer and Kang (2008). In this approach, we separately fit a predictive model (pos- 
sibly using machine learning techniques) for Y, using data in the treated group only and a predic- 
tive model for Yo using data in the untreated group only. Then we apply these models to obtain 
predicted values (;ı, i;o) for the potential outcomes for every subject in the data. ICE can be esti- 
mated as Â; = jı — jo. Alternatively, the observed response can be used in the calculation so that 
Â; = y; — io for the treated group and Â; = fj — yi for the untreated group. Note that this regression 
estimation method solely involves the outcome models. The underlying rationale is based on the 
fact that E (Y, |X = x) = E(Y|X = x,T = t) for t = 0,1, given strong ignorability and other condi- 
tions. However, the prediction across treatment groups heavily involves extrapolation, again due to 
the imbalance in covariates. When used for ACE estimation, Schafer and Kang (2008) found that 
it is not among the top performers, but may be possibly improved by incorporating the propensity 
score into the model. 

Estimating ICE also emerges as one intermediate step in some ACE inference procedures in- 
cluding structural nested models introduced by Robins (1989), marginal structural models (see, 
e.g., Robins 1999), and the targeted maximum likelihood method (see, e.g., van der Laan and Rubin 
2006). These procedures are particularly advantageous in dealing with longitudinal observational 
data where both treatment and covariates are time-varying, but they are also applicable to cross- 
sectional or ‘point treatment’ data. Two estimation methods are commonly used in these proce- 
dures: the g-computation and the inverse probability of treatment weighting (IPTW). Model (18) 
is often embedded in either method, for handling effect moderators in IPTW or being used as the 
Q-model in g-computation (see, e.g., Snowden, Rose, and Mortimer 2011) or targeted maximum 
likelihood (see, e.g., Rosenblum and van der Laan 2011) to model and predict potential outcomes. 
With g-computation, it is clear that other semiparametric or nonparametric data adaptive methods 
(as in ‘the ‘regression estimation’ approach) can be flexibly used for predicting potential outcomes 
for each observation under each possible treatment regimen. See van der Laan, Polley, and Hubbard 
(2007) and Austin (2012) for examples. 

Yet another method for estimate ICE E(Y; — Yo|X = x) directly is to relax X = x to a neigh- 
borhood of x, N(x). Such a neighborhood of x can be facilitated using either K-nearest neighbor 
(KNN) or, more generally, kernel smoothing. If KNN is used, let A(x) denote the corresponding 
neighborhood of x. A natural estimate of ICE is given by 


Li: wen VT Lis xeral — Fi) 
Yi wiene(x) Ti Ei xel l T) 
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This KNN approach assigns weight 1 to K observations within A(x) and weight 0 to others. More 
generally, we may use kernel smoothing to have weights depending on || x; — x || for all data points. 
To make it more robust to non-random treatment assignment mechanism, it might be possible to 
incorporate propensity score into the weights as well. While this implementation has not been seen 
in the literature, it has some promising potentials for its nonparametric nature and deserves further 
research. On the other hand, a neighborhood defined with high-dimensional data could have poor 
performance and the computation could be demanding. In addition, interpretation with respect to 
covariates becomes obscure with nearest neighbor approaches. 

Comparatively, the essential ingredient in our ensemble CIT approach is stratified causal esti- 
mates within subpopuations {x : a(x) = a}, which is intermediary in-between ACE and ICE. We 
have the convenience to either move forward for ICE with ensemble models or move backward for 
ACE by integrating stratified results. It is natural to use tree methods for extracting strata. Tree- 
structured methods are nonparametric in nature and hence more robust to model misspecification. 
Recursive partitioning excels in efficiently handling interactions and categorical variables and pro- 
vides meaningful interpretations. Besides, ensemble models usually performs better in predictive 
modeling. With that being said, a comprehensive comparison study of these alternative approaches 
in predicting ICE would be interesting for future research. 


3.5 Consistency 


In terms of asymptotic properties of recursive partitioning based estimators, Breiman et al. (1984) 
provided detailed developments of convergence in rth mean and uniform convergence on compacts. 
Gordon and Olshen (1984) established the almost sure convergence under certain constraints. In 
this section, consistency of the CIT based causal effect estimator is provided in the light of Breiman 
et al. (1984). 

Let the predictor space X € R? be Euclidean. A tree structure T partitions X into a number of 
disjoint sets or terminal nodes {t: T € T }. Again, t(x) denotes the terminal node where x falls into. 
Let d(-) be the diameter of a set, that is, d,(t) = sup, wer || x — x’ ||, where || - || is the Euclidean 
norm. With the observed data of size n, let k, be nonnegative constants such that, with probability 
one, P 

n: > knlogn foranyt€ J, 


where, same as before, nz; is used to denote the number of subjects in node qt that are assigned 
to the treated group, that is, nz) = Yjc,7j, and nw for the control group. Suppose that a(-) is a 
weak facilitating score and a, = )}j-,a(x;) denotes its mean vector in node T. Let (Y1, Yo, Y,7,x) € 
T represent a new observation that is independent of current data {(y;,7;,x;) : i = 1,...,n}. The 
following theorem establishes the mean square risk consistency for (Jz; — Jzo), the causal effect 
estimate based on direct comparison of sample means in the terminal node T = T(x). 


Theorem 5 Suppose that 
max {E|¥,|°*®, E|Yo| E} < M < for some £ > 0 and M > 0, (19) 


0 < e(x) < 1, and treatment assignment is strongly ignorable. Assume that E (Y; |a) and E(Yo|a) are 
continuous in a and a(x) is continuous in x. Further assume that 


lim k, = œ. (20) 


n—yoo 
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and 
lim d,(t) =0 (21) 
in probability. Then 
lim E |(¥e1 — Feo) — E{¥i — Yo (a(x) = a,}|? =0. (22) 


The results in Theorem 5 can be improved to L, convergence for any r > 1 if we change the as- 
sumption (19) to 
E|Y,|\"t® <M <œ and E|Y|"** <M < o. 


This can be immediately seen from the proof provided in Appendix C, where all the arguments 
we have used hold in L, spaces. Toth and Eltinge (2010) has recently proved asymptotic design 
L2 consistency of tree-based estimator when applied to complex survey data, following similar 
arguments in Gordon and Olshen (1978, 1980). Itis worth noting that the Horvitz-Thompson ( 1952) 
typed estimator via inverse probability weighting has fundamental use in both causal inference with 
observational data and in estimation the superpopulation mean with stratified survey data. 

These convergence results for recursive partitioning are obtained without dependence on the 
specifics of the algorithm. Unfortunately, no theoretical justifications have been obtained so far 
for the splitting rules and pruning algorithms (p. 327; Breiman et al. 1984). Moreover, one of key 
assumptions for consistency requires that the mesh size of T goes to 0 when the sample size gets 
large, as implied by assumption (21). This is an unappealing constrain to practical applications. 


4. Simulated Experiments 


In this section, simulation experiments are performed to first understand and assess CIT and make 
comparisons with other methods and then investigate how CIT performs under misspecification. 


4.1 Performance of CIT 


In terms of applications of tree methods relevant to treatment effect assessment, there have been two 
major developments serving different purposes: 1) propensity trees (PT) that estimate the propen- 
sity score e(X), as studied by McCaffrey, Ridgeway, and Morral (2004) and Lee, Lessler, and Stuart 
(2010); and 2) interaction trees (IT) for subgroup analysis (Su et al., 2009). An interaction tree ex- 
plicitly models the treatment-by-covariates interactions for detecting differential treatment effects. 
However, this method was developed for experimental data and does not take the non-randomized 
treatment assignment into account. As we shall demonstrate, failure or inadequacy to account for 
propensity information may lead to misleading interaction results, in that the superficial difference 
in treatment effects might have been caused merely by heterogenous treatment selection mecha- 
nisms. 
We generate data with the following steps. 


1. Generate X,...X5 independently from Unif(0,1) and create threshold variables Z; = 1¢y,<0,5} 
for j= 1,...,5. 


2. Set logit(a) = ap +.a1Z) + a2Zp with logit(m) = log{m/(1 — T) }. Generate T ~ Bernoulli(z). 





3. Set u = bo +bıT + b222 + b3Z3 + b4Z4 + b5T -Z4 and generate Y ~ N (u, 0?) with o = 1. 
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In addition to the response variable Y and treatment indicator T, each data set involves five covari- 
ates. In the above simulation strategy, covariate X; is an exposure or treatment predictor involved 
in the propensity model only, X2 is a confounder that relates to both T and Y, X3 is a response pre- 
dictor or prognostic factor, X4 is an effect-modifier, and X5 is a totally irrelevant covariate. All the 
covariate values are rounded at the second decimal place. 

By applying different values for the coefficients a;, i=0,1,2, and b;, j =0,...5, we can obtain 
different model configurations, for example, containing either interaction or confounding terms, 
both, or neither. We can also investigate how these tree methods handle covariates that play different 
types of roles in the causal pathway between T and Y. Specifically we consider the following five 
model configurations: 


l. 
° 


l. 


Model A. a= {a;}= (2,0,0), b= {b;} = (2,2,0,0,0,0) 
ModelB. a= {aj} = (2,2,—4)', b= {bj} = (2,2,2,2,2,2)'; 
ModelC. a= {aj} = (2,0,—4)’, b= {bj} = (2,2,2,0,2,2)’; 
Model D. a= {aj} = (2,2,—4), b= {bj} = (2,2,2,0,0,0)’; 

( ) ( ) 


Model E. a= {aj} =(2,2,—4)', b= {bj} = (2,2,0,0,2,2). 


Model A is a null model, where the covariates have no influence on the treatment effect. This model 
helps investigate the size issue or the type I error rate. Model B is equipped with all structures. 
Nevertheless, a massive tree with 16 terminal nodes is needed in order to fully represent the model 
structure. Model C also contains both confounding effect of Xz and interacting effect of X4, while 
neither X; nor X3 is involved. In this case, a tree with four terminal nodes is expected. Model D 
mainly involves the confounder X72, plus the exposure predictor X;. Lastly, the active components 
in Model E are the effect modifier X4 and the prognostic factor X3. 

For each simulated data set, all three tree methods, CIT, IT, and PT, are applied. Only one 
sample size is reported and the test sample method is used to select the optimal tree size, with 600 
observations for the training sample and 400 observations for the test sample. Both AIC and BIC 
are used for the tree model selection. For each final tree selected, we record the optimal tree size and 
the splitting variables involved in the final tree structure. Table 1 presents the summarized results 
over 200 simulation runs. 

We first examine the results from the null Model A. When BIC is used, all three tree methods 
seem rather conservative in committing Type I errors, implying that unsolicited signals are unlikely 
to be identified. With AIC, the empirical size, that is, the rate of giving false tree signals, is (100 — 
90.5)% = 9.5% for CIT, (100 — 88.5)% = 11.4% for IT, and (100 — 98.5)% = 1.5% for PT. 

Next, Model B contains all the components that are related to the treatment and the response. 
Experimenting with this model provides an overall picture of what patterns each tree method tends 
to recognize. It can be seen that CIT yields the largest tree models by mostly catching the effects of 
X2, X3, and X4. The treatment predictor X; is completely missed out by BIC and occasionally (32% 
of the time) selected by AIC. Note that X; is neither a confounder nor a modifier to the treatment 
effect. Due to the smaller penalty for mode complexity, AIC tends to select larger trees than BIC. 
As expected, the final propensity trees are split by both X; and X2. The average final tree size of IT 
is 2.92, compared to its expected value 2. It is interesting to note that IT frequently gets confused 
by the confounding effect of X2. 

Model C contains only the components that actively influence the causal effects, namely, the 
confounder X, and the effect-modifier X4. Both are perfectly identified by CIT. PT performs well 
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Selection Final Tree Size Splitting Variables 
Model Method Criterion 1 2 3 4 5 6 >7 Xı Xo X3 X4 X5 
A CIT AIC 905 55 15 15 10 00 00 45 20 25 20 2.0 
BIC 100.0 00 00 00 00 00 00 00 00 00 00 0.0 
IT AIC 885 60 45 10 00 00 00 25 40 35 3.0 3.0 
BIC 100.0 00 00 00 00 00 00 00 00 00 00 0.0 
PT AIC 985 05 10 00 00 00 00 05 05 05 1.0 0.0 
BIC 100.0 00 00 00 00 00 00 00 00 00 00 0.0 
B CIT AIC 0.0 40 00 195 5.5 39.5 31.5 32.0 96.0 96.0 100.0 1.5 
BIC 0.0 40 0.0 260 0.0 61.5 8.5 0.0 96.0 96.0 100.0 0.0 
IT AIC 0.0 65 25.0 505 13.0 3.0 20 7.0 935 60 100.0 8.0 
BIC 0.0 405 28.5 27.0 35 0.5 0.0 10 595 0.5 100.0 1.0 
PT AIC 0.0 0.5 335 615 40 05 00 995 100.0 05 2.0 1.5 
BIC 0.0 10 515 465 10 00 00 99.0 100.0 00 0.5 0.0 
C CIT AIC 0.0 00 45 905 5.0 0.0 0.0 1.0 100.0 2.0 100.0 0.5 
BIC 0.0 00 45 955 0.0 00 00 0.0 100.0 0.0 100.0 0.0 
IT AIC 0.0 47.0 35.5 140 15 15 0.5 0.5 52.5 0.5 100.0 2.5 
BIC 0.0 55.5 33.0 100.0 1.0 05 00 0.0 44.5 0.0 100.0 0.0 
PT AIC 0.0 97.5 10 15 00 0.0 0.0 1.0 100.0 10 0.5 0.5 
BIC 0.0 100.0 0.0 00 00 00 00 0.0 100.0 00 0.0 0.0 
D CIT AIC 0.0 10 83.5 11.0 20 15 10 990 1000 20 25 4.5 
BIC 0.0 105 895 0.0 0.0 0.0 00 895 100.0 00 0.0 0.0 
IT AIC 1.5 43.0 31.5 180 45 10 0.5 5.5 985 30 45 2.0 
BIC 0.2 545 28.0 140 15 0.0 0.0 10 980 05 1.5 0.0 
PT AIC 0.0 0.0 33.5 620 25 15 05 100.0 1000 20 25 1.0 
BIC 0.0 10 495 490 05 0.0 00 99.0 100.0 00 0.0 0.0 
E CIT AIC 0.0 00 25 890 85 0.0 0.0 15 3.0 100.0 100.0 2.0 
BIC 0.0 00 25 975 0.0 00 00 0.0 0.0 100.0 100.0 0.0 
IT AIC 0.0 900 90 10 00 00 00 25 15 2.5 100.0 3.5 
BIC 1.5 975 10 00 00 00 00 05 00 0.0 98.5 0.5 
PT AIC 965 35 0.0 0.0 0.0 0.0 0.0 15 05 10 05 0.0 


BIC 985 15 00 00 00 0.0 0.0 10 05 00 0.0 0.0 





Table 1: Simulation Results Based on the Test Sample Method: Relative frequencies (in percent- 
ages) of the final tree sizes in 200 runs identified by the causal inference tree (CIT), inter- 
action tree (IT), and propensity tree (PT). Only one set of sample sizes is reported, with 
600 observations forming the learning sample and 400 observations for the test sample. 


in identifying the confounder X2 while IT succeeds in recognizing the effect-modifier X4. The same 
interesting phenomenon as with Model B occurs again: IT wrongly selects Xz quite often. This will 
further be elaborated in Model D. 


Model D is basically a propensity model, involving both the exposure predictor X; and the 
confounder X only. In this case, CIT and PT provide equivalent results. Aiming at differential 
treatment effects, IT is supposed to have a null tree structure. However, we can see that most of 
time IT ends up with one or more splits on X2. To gain insight, Figure 1 plots the splitting statistic 
used in both IT and CIT versus each cutoff point for X2 in a single split of the data. The splitting 
statistic used in IT is a squared ż test statistic for interaction; thus the best cutoff point corresponds 
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Figure 1: Plot of splitting statistic versus cutoff point on the confounder X2: (a) ¢ test statistic 
(squared) for interaction used in IT; (b) likelihood ratio test statistic (up to some constant) 
used in CIT. Data were generated from Model D in Section 4. 


to the maximum of splitting statistics. It is interesting to note in Figure 1(a) that the splitting 
statistic actually reaches its minimum at X? = 0.5, the only place where the treatment comparison 
is unbiased. At other cutoff points, the splitting statistic as a measure of interaction misleadingly 
inflates due to lack of adjustment for propensity. On the contrary, this does not cause a problem for 
CIT, which correctly selects the right cutoff point 0.5 as shown in Figure 1(b). Therefore, in order 
to identify differential causal effects correctly, it is critical to take confounders into consideration; 
otherwise, the estimation bias owing to imbalance of confounders between treatment groups may 
become overwhelming and eventually lead to misleading conclusions about the differential causal 
effects. 


Finally, Model E is essentially an outcome regression model, in which both the prognostic 
factor X3 and the effect-modifier X4 are involved. It can be seen that CIT functions similarly to IT in 


2975 


SU, KANG, FAN, LEVINE AND YAN 














Effect Propensity Case I Case II Case III Case IV 

Group Ax ek Ak êk Ak êk Ak êk Ak ex 
1 -—1.940 0.156 —2.077 0.189 —3.246 0.502 —2.005 0.171 —2.106 0.201 
2 2.067 0.866 1.924 0.861 1.925 0.861 —0.098 0.857 1.916 0.860 
3 —1.938 0.843 —1.987 0.829 —2.629 0.676 —1.016 0.840 —2.006 0.824 





Table 2: Simulation Results for Assessing Sensitivity of CIT to Misspecification. Four scenarios 
are considered. In Case I, variables {X1 ,X2,X3,X4} are used; In Case II, the confounder 
Xə is omitted; In Case II, the effect-modifier X3 is omitted; In Case IV, the collider X5 
is included. The estimated treatment effect and propensity for each group were averaged 
over 100 runs. 


detecting treatment-by-covariate interactions. CIT also identifies splits on the prognostic factor X3. 
It comes as no surprise that PT, concerning propensity only, gives a null tree for most of the time. 


4.2 Sensitivity under Misspecification 


We next investigate how CIT performs under misspecified scenarios where an important confounder 
or effect-modifier is omitted or when a collider is included. We design an experiment with the 
following data generation scheme: 


1. Generate X),...,X4 independently from Unif(0,1) and create threshold variables Z; = 1 {x,<0.5} 
for j=1,...,4. 


2. Generate W; and W) independently from Bernoulli(0.5) and hence simulate X5 ~ A((2W; + 
2W», 1). 


3. Set logit(z) = 0.5 — Zı Z2 + W;. Generate T ~ Bernoulli(z). 
4. Set u=2+2Zı Z2 —2T +4Z, Z3T +W and generate y ~ N(u, 1). 


The observed data consist of repetitions of {Y,X,,...,X4}. With the above configuration, X; is both 
a confounder and an effect-moderator; X> is a confounder; X3 is a moderator; X4 is irrelevant; and 
Xs is a collider with the M diagram model (see, e.g., Figure 2(a) in Greenland 2003). The data 
essentially involve three groups with either different propensities or treatment effects. Observations 
in Group 1 satisfies Z1Z2 = 1; Group 2 is characterized by (1 — Z;)Z3 = 1; and Group 3 contains the 
others. 

In order to assess sensitivity, an independent validation set with 5,000 observations is first gen- 
erated. Based on true grouping, the causal effect and propensity for each group are computed and 
presented in Table 2. Next, a total of 100 simulation runs are considered. For each simulation run, a 
training set with 600 observations and a test set with 400 observations are generated, on which basis 
CITs are constructed using different sets of variables. In Case I, variables {X1 ,X2,X3,X4} are used; 
Case II uses {X1 ,X3,X4} with confounder Xz omitted; In Case III, {X1 , X2,X4} are used by omitting 
the moderator X3; In Case IV, {X),X2,X3,X4,Xs5} are used by including the collider X5. Each final 
CIT (based on BIC) is applied to the validation set to compute the individual causal effect Ai and 
propensity ê; for each observation in the validation set. The predicted ICEs and propensities are 
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aggregated for each group, based on the true grouping. The grouped causal effect and propensity 
estimates are then averaged over 100 simulation runs. The results are also presented in Table 2. It 
can be seen that, in Case I, CIT does very well in estimating treatment effects and propensities. In 
both Case II and Case III, substantial bias is present in estimating the treatment effects. The results 
for Case IV suggest that the collider X5 also introduces bias. However, compared to the bias from 
omitting a confounder or moderator, the bias from including a collider is much smaller. This is 
consistent with the conclusions in Greenland (2003). 


5. Analysis of NSW Data 


As an illustration, we revisit the NSW data set extensively analyzed by LaLonde (1986) and Dehejia 
and Wahba (1999), where the objective is to assess the impact of the National Supported Work 
(NSW) Demonstration on post-intervention income levels. The NSW demonstration was a labor 
training program implemented in the mid-1970s to provide work experiences for a period of 6- 
18 months to individuals facing economic and social difficulties. NSW itself was designed as a 
randomized controlled study where subjects were randomly assigned to two treatment groups: the 
NSW-exposed group and the unexposed group. 

With a rather innovative approach that later on became influential, LaLonde (1986) compiled 
a composite data set by taking subjects in the NSW-exposed group only and then obtaining the 
nonexperimental control group from other sources, including the Panel Study of Income Dynamics 
(PSID) and the Current Population Survey (CPS) databases. His aim was to examine the extent to 
which nonexperimental estimators can replicate the unbiased experimental estimate of the treatment 
impact. He concluded that nonexperimental estimators are either inaccurate relative to the experi- 
mental benchmark or sensitive to model specification. Since then, the mixed NSW data have been 
analyzed by various authors with alternative approaches. Among others, Dehejia and Wahba (1999) 
obtained estimates of the treatment effect that are close to the experimental benchmark estimate or 
the ‘gold’ standard using propensity score matching and stratification. 

Most of these previous works are focused on estimating the ACE of NSW. Here we shall apply 
the CIT methods to explore the variabilities of its effects, in addition to dealing with the nonrandom 
treatment assignments. There are several versions of the data with varying sources for obtaining the 
control or unexposed group, available from http: //www.nber.org/~rdehejia/nswdata.html. 
The data set we use is the one available in the R package MatchIt contributed by Ho et al. (2007, 
2011). This is a subset restricted to males who had 1974 earnings available, for the reasons explained 
in Dehejia and Wahba (1999). There are 614 observations (185 treated and 429 control) and 10 
variables in the data, which include the treatment assignment indicator. A brief description and 
some summary statistics of these variables are provided in Table 3. The outcome variable is re78, 
the 1978 earnings. All covariates but educ are severely unbalanced between the participants actively 
exposed to NSW and those in the unexposed group selected from other survey databases. 
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(a) Propensity Tree (b) Interaction Tree 
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Figure 2: Final Tree Models for the NSW Data: (a) Propensity Tree; (b) Interaction Tree; (c) Causal 
Inference Tree. 
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(a) Continuous Variables 

















Variable All NSW Exposed Unexposed P-value 
Name Description mean sd mean sd mean sd two-sample t Wilcoxon 
age Age in years 27.36 9.88 25.82 7.16 28.03 10.79 0.0107 0.5195 
educ Schooling years 10.27 2.63 10.35 2.01 10.24 2.86 0.6330 0.7920 
re74 1974 earnings 4,557.55 6,477.96 2,095.57 4,886.62 5,619.24 6,788.75 0.0000 0.0000 
re75 1975 earnings 2,184.94 3,295.68 1,532.06 3,219.25 2,466.48 3,292.00 0.0012 0.0000 
re78 1978 earnings 6,792.83 7,470.73 6,349.14 7,867.40 6,984.17 7,294.16 0.3342 0.2818 
(b) Discrete Variables 
Variable Frequency P-value 
Name Description NSW Exposed Unexposed x? Fisher’s Exact 
black 0-No 29 342 0.0000 0.0000 
1 - African-American 156 87 
hispan 0-No 174 368 0.0053 0.0026 
1 - of Hispanic origin 11 61 
married 0-No 150 209 0.0000 0.0000 
1 - Yes 35 220 
nodegree 0-No 54 173 0.0113 0.0106 
1 - Has a high school degree. 131 256 




















Table 3: Variable description and summary statistics for the NSW data set. All earnings are expressed in U.S. dollars. 
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(a) Propensity Tree 
NSW Group Unexposed Group Estimated 
Node size mean sd size mean sd Propensity 





I 20 8.1423 6.6646 75 5.2302 6.3981 21.05% 
H 9 6.0534 4.9218 267 8.1712 7.6170 3.26% 
MI 156 6.1363 8.1435 87 4.8534 6.2017 64.20% 





(b) Interaction Tree 














NSW Group Unexposed Group Treatment Effect 
Node size mean sd size mean sd estimate s.e. 
I 83 5.0392 5.1160 152 5.2804 5.5401 —0.2412 0.7192 
I 66 8.4894 10.3819 69 3.4528 5.8233 5.0366 1.4576 
I 36 5.4455 7.0965 208 9.4007 8.0201 —3.9552 1.3070 





(c) Causal Inference Tree 

















NSW Group Unexposed Group Estimated Treatment Effect 
Node size mean sd size mean sd Propensity estimate s.e. 
I 22 8.1431 6.3676 156 4.8438 5.6728 12.35% 3.2993 1.3118 
II 7 5.4539 5.3997 186 9.7759 8.0259 3.62% —4.3221 3.0634 
Il 71 4.6987 4.8043 55 4.8545 5.9303 56.35% —0.1558 0.9564 
IV 15 3.8662 3.9130 10 1.0999 2.8541 60.00% 2.7663 1.4438 
V 70 8.0809 10.7408 22 6.5570 7.3371 76.09% 1.5239 2.4565 





Table 4: Summary statistics for the terminal nodes: (a) the final propensity tree (PT); (b) the final 
interaction tree (IT); and (c) the final causal inference tree (CIT). The means and standard 
deviations are given in thousand dollars. 


We applied three tree procedures to the data: PT, IT, and CIT. The final tree structures, all 
selected by BIC, are plotted in Figure 2. Considering the moderate sample size, a bootstrap method 
was used for final tree selection. In Figure 2, the internal nodes are denoted by circles. The splitting 
rule is given under each internal node. Observations satisfying the rule go to the left child node 
and observations not satisfying go to the right child node. The terminal nodes are denoted by 
rectangles and renamed by Roman numerals inside. Underneath each terminal node is the number of 
exposed subjects versus the number of unexposed subjects within the terminal node. Some summary 
statistics for the terminal nodes in each final tree are provided in Table 4. 

Figure 2(a) gives the final PT structure, which delineates a meaningful heterogeneity in propen- 
sity. It is clear that African Americans were more likely to participate in this labor program. PT 
also identifies a group, terminal node II, with extremely low propensity (3.26%). This group is 
characterized by people who were not African Americans and had some income in 1974. However, 
this PT model tells nothing about differential treatment effects. 

Figure 2(b) displays the final IT structure. Variables re74 and age stand out as determinants 
of differential causal effects. Apparently remarkable differential treatment effects seem to exist 
across the three terminal nodes based on Table 4(b). However, since the method does not adjust 
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Figure 3: Aggregated Grouping for the NSW Data: (a) Multidimensional scaling (MDS) plot; (b) 
Dendrogram for hierarchical clustering with average linkage. The distance matrix was 
computed by aggregating 100 bootstrap samples. 
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for heterogeneous propensity, the results are not reliable. Hence we make no further attempt in 
interpreting. 

Figure 2(c) presents the final CIT model, which has a more comprehensive structure. It is 
interesting to see that the left-half of the tree resembles the PT tree in Figure 2(a). In particular, the 
CIT comes up with a similar terminal node II, which contains non-African Americans with income 
higher than $2,721 in 1974. Since CIT accounts for both propensity and differential causal effects, 
it is valid to estimate the NSW effect via direct comparison of sample means within each terminal 
node. Table 4(c) provides the relevant quantities. CIT also identifies some interesting patterns 
of differential treatment effects. The surprising comparison occurs to terminal node II, where the 
NSW-exposed group had a lower average income than the unexposed group with a mean difference 
of $4,322. However, this should not be a point of great concern due to its very low propensity 
3.62%. 
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Figure 4: Plot of the Estimated Personal Causal Effects vs. Propensity Scores for the NSW Data. 
Referring to Algorithm 2, B = 1000 bootstrap samples were used in a three-fold cross 
validation procedure and the parameter m was set as 3. 


If it is agreed that terminal node II be excluded from consideration due to lack of comparison 
basis and the minor negative effect of NSW in terminal node III be ignored, then one may tentatively 
conclude the absence of qualitative interactions. Using Equations (15)-(16) and information in Table 
4(c), the ACE is estimated as $1,845 + $809, which is very close to the benchmark randomized 
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experiment estimate of $1,794 + $633. As a comparison, the unadjusted estimate is $635 + $677 
and the ANCOVA estimate is $1,548 + $781 after adjusting for all covariates. It is worth mentioning 
that the ANCOVA estimate varies dramatically when different sets of variables are included in 
the model. Using nonparametric matching method (Ho et al., 2007) implemented in the MatchIt 
package, the subclassification estimate (with 5 subclasses) is $1,237 + $1,196 and the optimal 
matching (Rosenbaum, 1989) based estimate is $1,366 + $720. 

Next, we applied the aggregated grouping method described in Algorithm 1. We obtained an 
averaged distance matrix from 100 bootstrap samples. The modal number of optimal tree sizes 
is 5. The classical MDS (Gower, 1966) was used to explore the the distance matrix. Figure 3(a) 
provides the resultant plot when the data are represented in a two-dimensional space. Agglomerative 
hierarchical clustering with average linkage was then used to determine the final clusters. See Figure 
3(b) for the dendrogram. The cluster membership specification was also added to the MDS plot in 
Figure 3(a). It can be seen that Cluster 2 and Cluster 5 are distant from other three clusters. Table 
5(a) shows the correspondence between the five clusters and the five CIT terminal nodes. It can 
be seen that overall they match well, except for minor inconsistency between clusters 4 & 5 and 
terminal nodes IV & V. This indicates that the CIT structure is relatively stable. The summary 
statistics for the five clusters are outlined in Table 5(b), showing a pattern similar to Table 4(c). 
After removing Cluster 2, the estimate of ACE is $1,897 + $807. We would like to emphasize that 
the excluded Group II in CIT can be explained by the fact that people who were not black and had 
some income in 1974 seemed unlikely (with estimated propensity 3.62%) to participate the NSW 
intervention program. This easy interpretation is no longer available with Cluster 2 obtained from 
the aggregated grouping procedure. 

Finally, ensemble CITs were used to estimate the ICE and propensity score for each individual. 
Referring to Algorithm 2, three-fold (V = 3) cross-validation with B = 1,000 bootstrap samples 
(with stratification on treatment) was used in the analysis; and at each split, m = 3 variables were 
randomly selected as candidate splitting variables. Figure 4 plots the estimated ICE vs. propensity 
scores. The interpretation for ICE is the difference between what an individual would have earned 
in 1978 if he had attended NSW, compared to the 1978 earnings if he had not attended. It can be 
seen that the area with low propensity (below .10) is dominated by subjects in the control and their 
associated personal effects of NSW are quite mixed. Other than that, the intervention program seem 
to have an overall positive effect. Figure 5 summarizes the results for each treatment-by-stratum 
combination, in which the five strata obtained from aggregated grouping are used. It can be seen 
that both propensity and individual causal effects are reasonably homogeneous within each stratum, 
even though the individuals were from different treatment groups. 














6. Extension to Ordinal/Continuous Treatments 


The concept and properties of the facilitating score can be extended to scenarios where the treatment 
variable is nominal (Lechner, 1999) or ordinal (Imbens, 2000). Suppose that the treatment variable 
T is allowed to range within 3, where S is a discrete set with ordered or unordered values. Let 
Y, = Y,(@) denote the potential outcome if unit œ was assigned to the treatment level t. Let e,(X) = 
Pr{T = t|X} be the generalized propensity score (GPS). A generalized weak facilitating score can 
be defined as below. 


Definition 6 A generalized weak facilitating score a(X) is a q-dimensional (0 < q < p) function of 
X such that (i) X LL T |a(X) and (ii) E (Y, — Yp |a(X)) = E (Y, — Yp |X) for any t,t! € 3. 
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Figure 5: Parallel Box-Plots of (a) the Propensity Scores and (b) the Estimated ICE for Each of 
the Treatment x Stratum Combinations. The ‘0.k’ combination corresponds to individuals 
in Stratum k who did not attend the NSW program while ‘1.k’ corresponds to those in 
Stratum k who did, for k = 1,...,5. The width of each box has been made proportional 
to the sample size in each combination. 


Condition (ii) is equivalent to saying that E (Y, — Y,|a(X) = a) is independent of X. The following 
theorem provides a basis for its usage. It shows that, if the joint distribution of (Y, T) can be modeled 
through a vector-valued function h(X), then h(X) is a generalized weak facilitating score and direct 
estimates of causal effects can be obtained by conditioning on h(X) =h. 
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(b) Summary of Five Groups 

















NSW Group Unexposed Group Estimated Treatment Effect 
Node size mean sd size mean sd Propensity estimate s.e. 
1 22 8.1431 6.3676 156 4.8438 5.6728 12.36% 3.2993 1.3118 
2 7 5.4539 5.3997 186 9.7759 8.0259 3.63% —4.3221 3.0634 
3 71 4.6987 4.8043 55 4.8545 5.9303 56.35%  —0.1558 0.9564 
4 21 4.1514 4.3182 12 2.4027 5.5116 63.64% 1.7487 1.7283 
5 64 8.3825 11.0826 20 6.3210 7.1054 76.19% 2.0614 2.6383 





Table 5: Results for the five groups obtained from aggregated grouping: (a) correspondence be- 
tween the obtained groups and the five CIT terminal nodes; (b) summary statistics. The 
distance matrix was computed from 100 bootstrap samples and hierarchical clustering with 
average linkage was used for determining the final groups. 


Theorem 7 Assume that the conditional joint density of (Y,T) given X, fy,r\x(Y,T|X), can be 
written as fy r\x(¥,T |X) = g(Y,T,h(X)) for some function g(-). In other words, (Y,T) LL X|h(X). 
Further assume that treatment assignment is strongly ignorable so that Y, 1L lir=n |X for any 
t € S. When 0 < e;(X) < 1, we have 


(1). W(X) is a generalized weak facilitating score. 
(2). Concerning the causal effect in subpopulation Q} = {0 : h(X(@)) = h}, 
E(%,—Yelh(X)=h) = E{¥|T =1,h(X) =h} — EY) IT =1',h(X) =h} 
E{Y|T =t,h(X) =h}—E{Y|T =1',h(X) =h} 
is independent of X. 


The proof of Theorem 7 is deferred to Appendix A. As stressed by Lechner (1999) and Imbens 
(2000), GPS e,(X) does not have a causal interpretation. However, the reinforced assumption 
fy,r\x(Y,T|X) = g(¥,T,h(X)) implies that e,(X) can be fully characterized by h(X) or its com- 
ponents. This is analogous to the assumption of uniquely parameterized propensity function in Imai 
and van Dyk (2004), where a parametric form is prescribed for e;(X). To estimate h(X), a multi- 
nomial or cumulative logit model can be used for propensity and the outcome can be modeled with 
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multiple linear regression. The above results also can be extended to continuous treatment variables 
with arguments similar to Hirano and Imbens (2005). 


7. Discussion 


Embedded in Rubin’s causal model, we have introduced a new concept, the facilitating score, to 
help tackle the heterogeneity in both propensity and causal effects. The facilitating score is a finer 
balancing score of Rosenbaum and Rubin (1983), plus additional conditions for dealing with dif- 
ferential causal effects. It supplies a framework that promotes joint modeling of (Y,T) for a better 
understanding of causal effects. Accordingly we have devised recursive partitioning methods to aid 
in causal inference at different levels. 

The facilitating score concept and the CIT methods can be useful in personalized medicine and 
other similar applications. Medical treatment is traditionally centered on standards of care on the ba- 
sis of large epidemiological cohort studies or randomized trials that are powered for assessing ACE. 
These studies however do not account for variabilities of individuals in reacting to the treatments 
and drug-to-drug interactions. The new medical model of personalized medicine or treatments 
seeks flexible ways that allow for treatment decisions or practices being tailored to individual by 
integrating post-trial clinical data and new developments in biotechnology to improve healthcare. 
The collected covariates are often expanded to a more comprehensive consideration of the patient, 
including medical measurements, family history, social circumstances, environment and behaviors, 
and biological variables. As a result, the data are often observational and high-dimensional in na- 
ture. As demonstrated in the NSW data example, causal inference in observational studies could be 
very complex, owing to the confounding and interacting effects complicated by covariates. While 
personalized medicine is the ultimate goal, stratified medicine has been the current approach. Strat- 
ified medicine aims to select the best therapy for groups of patients who share common biological 
characteristics. The proposed CIT method and aggregated grouping can be used seeking strategies 
for deploying stratified medicines. Insight into a greater degree of personalized treatment can be 
gained by studying the personal treatment effects with ensemble CITs. 

Some limitations of the proposed methods are listed below. First, despite the usefulness of 
ICE, assessing ICE entails larger data than assessment of ACE in order to have the same level of 
precision (or variance). There are many trials in research practice that are only powered to detect 
ACE. For this reason, the proposed methods are best suitable for moderately-sized or large follow- 
up data collected in post trial periods or extracted from Medicare or Medicaid databases, in which 
randomization is not available. Secondly, the recursive partitioning methods are highly adaptive or 
data-driven in nature and often regarded as exploratory or hypothesis-generating. It is important to 
interpret the results with caution. In addition, the validity of Theorem 7 relies on the assumption of 
strong ignorability. Like other methods, CIT performance is vulnerable to violated assumptions and 
model misspecification. Shpitser and Pearl (2008) examines possibly milder conditions to ensure 
identifiability and facilitate estimation in causal inference. It would be interesting to investigate how 
to extend the proposed methods under mild conditions. 

In terms of future research, Theorem 7 is readily applied to data with binary outcomes. With 
further research efforts, both the facilitating score and CIT may be extended to other types of out- 
comes such as censored survival times or longitudinal measurements. It would also be interesting to 
extend the proposed methods to scenarios when both treatment and confounders are time-varying, as 
studied in marginal structural models and structural nested models (Robins, 1999), and when some 
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confounders are unmeasured but there exist some instrumental variables (IV; Angrist, Imbens, and 
Rubin 1996) that satisfy the strong ignorability and other conditions. In addition, Robins, Rotnitzky, 
and Zhao (1994) proposed doubly robust (DB) estimation methods to deal with mis-specification in 
either the response model or the propensity model. Along similar lines, the targeted maximum like- 
lihood (TML; van der Laan and Rubin 2006) is another newly developed causal inference method 
that enjoys a favorable theoretical property for being doubly robust and locally efficient, meaning 
that if at least one of the propensity and outcome models is correctly specified, then the TML es- 
timator is consistent and asymptotically normal; if both models are correctly specified it is also 
efficient. Similar work with facilitating score modeling could be another avenue for future research. 
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Appendix A. Proof of Theorem 7. 


We sketch the proof when T is ordinal or nominal. Theorem 3 follows as a special case when 
3 = {0,1}. Some steps are standard arguments in propensity score theories. We include them for 
the sake of completeness. 

First of all, the conditional probability density function of T|X is 


fx TIX) = f frx (WT XAY = f a(¥,7.W(X))a¥. 








Thus the GPS e,(X) = P(T = t|X) = gi (h(X)) for some function g;(-). Namely, h(X) is a finer 
function of e,(X). For this reason, we denote e,(X) = e;(h(X)). 
Next, since h(X) is measurable with respect to, o(X), the o-algebra generated by X, 





Pr{T =t|X,h(X)} = Pr{T =1|X,} =e,(X). 





Let 6, = JHT = t } be the indicator function of whether T = t. By iterated expectation, 


Pr{T = 1|h(X)} E(8,|h(X)) = E{E(6,|X,h(X))|h(X) } 


E{E(6;|X)|h(X) } = E{e,(X)|n(X) } = e:(X). 





Namely, Pr{T = t|X,h(X)} = e,(X) = Pr{T =1|h(X)}, which implies T LL X|h(X). 

Further assuming the treatment assignment is strongly ignorable given X, it follows that the 
treatment assignment is ignorable given h(X), that is, T LL Y,|h(X), which can be established by 
showing 





Pr{T =1'|Y,,h(X)} E{8; |Y, ,h(X)} = ELE (8X, ¥,,h(X))|¥,,h(X) } 
E{E(8|X)|¥,,h(X)} due to strong ignorability 


E{er(X)|¥;h(X)} = er (X) = P{T =1'h(X)}. 
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To check condition (ii) in Definition 6, consider E {Y;,|h(X)}. Since Y = $, Y,6, and 6,6, = 0 for 
t#t', we have Yô, = Y,6,. Consider 
EXY,;|h(X)} EXY,|h(X)} -E{8;|h(X) }/E {5;|h(X) } 
E{Y,6,|h(X)}/E{8,|h(X)} by strong ignorability 
E{Y68,|h(X)}/e,(X). 
It can be seen that h(x) is a finer function of both the numerate and denominator in the above 
expression. Thus E {Y,|h(X) = h} is fully determined by h and no longer relies on the value of X. 


Finally, in order to have available causal inference, it is important to note that, for given t and 
t', h(x) = h fully determines both e;(h) and ey (h). Therefore, 


E{Y,—Yy|h(X) =h} = E{Y,|h(X) =h,T =1,e,(h)} — £{¥)|h(X) =h,T =r',ey(h)} 
E{Y|h(X) =h,T =1} —E{Y|h(X) =h,T =7'} 


is independent of X. This justifies the direct use of mean response comparison for causal inference 
in subpopulationQ,. ff 


Appendix B. Proof of Proposition 4. 


First of all, condition (i) in Definition 2 holds as X LL T|h3(X). Assuming (Y;,Yo) LL T|X, it 
follows that (Y1, Yo) LL T |43(X) under strong ignorability. 
Now it suffices to verify condition (ii). Consider 


E{Y, |h2(X),h3(X)} 


E{Y,|T = 1,h2(X),h3(X)} 

EY |T =1,h2(X),h3(X)} 
E{E(Y|X,T = 1)|ho(X),h3(X)} 

= Efo +y +hı (X) +/2(X) | h2(X),h3(X)} 
= Yor y tho(X)+£ {hy (X)| M(X), h(X)} 


Similarly, it can be found that 
E{Yo|ho(X),43(X)} = Yo +E (hı (X)|h2(X),h3(X)}. 


Thus, 
E{Y — Yo |h2(X) = hz, h3(X) = h3} =Y1 +m, 


which is independent of X. J 


Appendix C. Proof of Theorem 5. 


The following lemma (see, e.g., Chapter 9 of Lin and Bai 2011), derived directly from C, inequality, 
will be used in the proof. 


Lemma 8 Given a sequence X,,...,X, of random variables, X, = ¥}_; Xi /n. Then 


= L 2 
E|\X,|" <—-) E|x;|" >1. 
Bell << LEXI frr 
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By condition (ii) in Definition 2 of a(x), 


(Fa — Fo) -EY —Yol@e) = (Fu — Yeo) — E (V1 — Yo|x) 
+E (Yı — Yola(x)) — E (Yı — Yo |ar) 
= {fu —E(%i|x) +E(M% |a(x)) -E (Yı |ac)} — 
{Fv — E(Yo|x) +E (Yo |a(x)) — E (Yo|ac) } 
= €:-% 


For convenience, we have used a, as shorthand for the conditioning event {a(x) = a,}. To prove 
(22), it suffices, by Minkowski’s inequality, to verify the mean square or Lz consistency for C; and 
Co separately. 

Consider 

G1 = u — E (Vix) } + {EM [a(x)) — EM lac) f, 

which has two terms. We examine the second term {E (Y; |a(x)) — E(Y|a;)} first. If assumptions 
(19), (20), and (21) hold, then a, & a(x) by Theorem 12.7 of Breiman et al. (1984, p. 322). Since 
E(Y,|a) is assumed continuous in a, 


E(Y;\a;) 2 E(Y,|a(x)) 


by the continuous mapping theorem. Moreover, since |E(Y;|a:)| < E(|Yi||a:) < E(|Y1|) < æ, it 
follows that 
lim E |E (Y1 |ā;) — E Yı la(x))|? =0 


by the dominated (or bounded) convergence theorem. 
Next, consider the first term in 61, {971 — E (Yı |x) }. Rewrite yzı as 


EIT EYT /m 


IET iET n 


V= Yn = Y T;/n zo 


IET iET 





which is a ratio estimator. The convergence of ratio estimators in the general form was studied by 
Cramér (1946). Using Theorem 12.7 of Breiman et al. (1984) again, we have &,, AVE (YT |x) and 
Pn ~ e(x) in probability. Thus 





én a EOTS _ 5 


a J) (Yi |x) 


in probability as well if e(x) 4 0, under the assumption of strong ignorability. To establish its mean 
square risk consistency, the necessary and sufficient condition is that the random sequence {9} }n is 
uniformly integrable, that is, 

lim supE {¥z1(Fe1 > co)} = 0. 

coe n 
A sufficient condition for uniform integrability (?) is that 


sup E|Fx|°** < o 
n 
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for some € > 0. This can be verified because 


1 
supE|ja tE <sup— Yo EYP <M<c 
á n Ul fiet: T=1} 


following from (19) and Lemma 8. 
Therefore, lim, E|C|* = 0 using Minkowski’s inequality again. Similar arguments can be used 
to show lim, E|Co|? =0. This completes the proof of Theorem 5. ff 
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